MONet Community Science Meeting 2023
This RMarkdown document contains code and scripts for preliminary processing and data visualization of soil biogeochemistry data. These graphs were created for use in Session 3 (Biogeochemistry Data Tutorial) on Tuesday, November 7.
USING THIS TUTORIAL
Download the bgc_data file and bgc_analyses file. Create a new R Script file and save it in the same folder as the data files. Make sure all these files are saved in the same folder, to minimize errors with file paths.
This file is a Markdown report, which includes text as well as code chunks. You can copy the R code directly from these chunks and paste it into your R Script file.
You will need the {tidyverse} set of packages to run
this tutorial. The tidyverse includes packages like
{dplyr} and {tidyr} for data wrangling, and
{ggplot2} for data visualization.
library(tidyverse) # for general data wrangling and visualization
theme_set(theme_bw(base_size = 12)) # set default ggplot theme
library(ggcorrplot) # for correlation matrix plot
library(ggtern) # for soil texture triangle (ternary plot)
You may also want to download additional packages to jazz up your
figures, such as the {soilpalettes} package, which includes
soil-themed color palettes.
# Install `soilpalettes` using:
# install.packages("devtools"); devtools::install_github("kaizadp/soilpalettes")
library(soilpalettes) # for color palettes
We need two files for this tutorial. bgc_data includes
biogeochemical variables, and also has site metadata such as
latitude-longitude and biome type. bgc_analyses includes
additional information about the analyses performed.
You will need to change the file path before using this tutorial. The
current file path reflects the directory structure in the GitHub
repository.
If you downloaded the tutorial and the data file into the same folder,
change the code chunk below to
bgc_data = read_csv("bgc_data.csv") and
`bgc_analyses = read_csv("bgc_data.csv") %>% ...
bgc_data = read_csv("Biogeochem/data/bgc_data.csv")
bgc_analyses = read_csv("Biogeochem/data/bgc_analyses.csv") %>% dplyr::select(analysis, analysis_type)
We will use the data in two forms - wideform (bgc_data
is already in wideform, with each analyte as a separate column) and
longform (see below).
data_long =
bgc_data %>%
pivot_longer(cols = -c(Site_Code, Location, Lat, Long, biome_name),
names_to = "analysis") %>%
left_join(bgc_analyses)
One additional processing step before we start the tutorial – The
Location column refers to the depth of the soil sample,
“TOP” or “BTM”.
By default, R sorts its data alphabetically. Change the order here,
before proceeding.
bgc_data =
bgc_data %>%
mutate(Location = factor(Location, levels = c("TOP", "BTM")))
Basic jitter plots to determine the range and distribution of each variable. Grouped by analysis type.
data_long %>%
ggplot(aes(x = analysis, y = value, color = Location))+
geom_jitter(width = 0.3)+
facet_wrap(~analysis_type, scales = "free")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The samples were collected from sites spread across five biomes
## Create summary of biome counts
biome_numbers =
data_long %>%
distinct(Site_Code, biome_name) %>%
group_by(biome_name) %>%
dplyr::summarise(n = n()) %>%
force()
biome_numbers %>% knitr::kable()
| biome_name | n |
|---|---|
| Deserts & Xeric Shrublands | 13 |
| Mediterranean Forests, Woodlands & Scrub | 1 |
| Temperate Broadleaf & Mixed Forests | 21 |
| Temperate Conifer Forests | 19 |
| Temperate Grasslands, Savannas & Shrublands | 11 |
| NA | 5 |
Use this function to set up the correlation plots for the entire
dataset.
We use the ggcorrplot package for this.
Creating a correlation matrix plot can be complicated. We have
simplified this by making a custom function
fit_correlations_function. This bundles all the
processing/calculation steps into a neat function, which you can then
run below.
fit_correlations_function = function(dat, TITLE = ""){
num =
dat %>%
dplyr::select(where(is.numeric)) %>%
drop_na()
num_clean =
num %>%
rownames_to_column("row") %>%
pivot_longer(-row) %>%
separate(name, sep = "_", into = c("name")) %>%
pivot_wider() %>%
dplyr::select(-row)
m = cor(num_clean)
p.mat <- ggcorrplot::cor_pmat(num_clean)
ggcorrplot::ggcorrplot(m, type = "lower",
p.mat = p.mat,
outline.color = "black",
# lab = TRUE,
insig = "blank",
colors = c("#E46726", "white", "#6D9EC1"),
title = TITLE)
}
Now, apply the fit_correlations_function function to the
dataset to generate the correlation plot.
# this will generate a correlation matrix for the entire BGC dataset
corr_all = fit_correlations_function(bgc_data %>% dplyr::select(-Lat, -Long, -GWC))
# However, there are a lot of variables here.
# So, for an easier plot, subset select variables and then re-run the correlation matrix.
data_subset = bgc_data %>%
dplyr::select(Ca, Mg, CEC,
Clay, SO4S, NH4N,
TC, TN, Location)
fit_correlations_function(data_subset %>% filter(Location == "TOP"))
Individual Correlations
Here is simple code to visualize correlations among variables. You can also do this in the Shiny App.
bgc_data %>%
ggplot(aes(x = TC, y = TN))+
geom_point()+
geom_smooth(method = "lm", se = F)+
labs(x = "Total C (%)",
y = "Total N (%)")
bgc_data %>%
ggplot(aes(x = Clay, y = Sand))+
geom_point()+
geom_smooth(method = "lm", se = F)+
labs(x = "Clay (%)",
y = "Sand (%)")
bgc_data %>%
ggplot(aes(y = CEC, x = Clay))+
geom_point()+
geom_smooth(method = "lm", se = F)+
labs(y = "Cation Exchange Capacity (meq/100g)",
x = "Clay (%)")
bgc_data %>%
ggplot(aes(x = Ca, y = Bases))+
geom_point()+
geom_smooth(method = "lm", se = F)+
labs(y = "Total bases (meq/100g)",
x = "Calcium (meq/100g)")
bgc_data %>%
ggplot(aes(x = TC, y = WEOC))+
geom_point()+
geom_smooth(method = "lm", se = F)+
labs(x = "Total carbon (%)",
y = "Water-extractable organic carbon (mg/g)")
These are “jittered” scatter plots showing distribution of the
variables by depth or by biome.
Also included is code to calculate the analysis of variance (ANOVA)
TC ANOVA
lm((TC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
##
## Response: (TC)
## Sum Sq Df F value Pr(>F)
## Location 151.92 1 24.2377 2.9e-06 ***
## biome_name 92.18 4 3.6768 0.007448 **
## Location:biome_name 61.48 4 2.4521 0.049940 *
## Residuals 714.55 114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot TC by depth
bgc_data %>%
ggplot(aes(x = Location, y = TC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
geom_jitter(size = 3, width = 0.3)
plot TC by biome
bgc_data %>%
ggplot(aes(x = biome_name, y = TC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
geom_point(size = 3, position = position_dodge(width = 0.5))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
labs(x = "")
CEC ANOVA
lm((CEC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
##
## Response: (CEC)
## Sum Sq Df F value Pr(>F)
## Location 100.0 1 0.7992 0.3732
## biome_name 3283.7 4 6.5619 8.661e-05 ***
## Location:biome_name 466.0 4 0.9311 0.4486
## Residuals 14261.9 114
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot CEC by depth
bgc_data %>%
ggplot(aes(x = Location, y = CEC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
geom_jitter(size = 3, width = 0.3)+
labs(x = "")
plot CEC by biome
bgc_data %>%
ggplot(aes(x = biome_name, y = CEC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
geom_point(size = 3, position = position_dodge(width = 0.5))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
labs(x = "")
Clay ANOVA
lm((Clay) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
##
## Response: (Clay)
## Sum Sq Df F value Pr(>F)
## Location 483.7 1 5.2195 0.02412 *
## biome_name 2574.6 4 6.9449 4.686e-05 ***
## Location:biome_name 88.4 4 0.2385 0.91607
## Residuals 10936.1 118
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot Clay by depth
bgc_data %>%
ggplot(aes(x = Location, y = Clay, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
geom_jitter(size = 3, width = 0.3)+
labs(x = "")
plot Clay by biome
bgc_data %>%
ggplot(aes(x = biome_name, y = Clay, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
geom_point(size = 3, position = position_dodge(width = 0.5))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
labs(x = "")
WEOC ANOVA
lm((WEOC) ~ Location * biome_name, data = bgc_data) %>% car::Anova()
## Anova Table (Type II tests)
##
## Response: (WEOC)
## Sum Sq Df F value Pr(>F)
## Location 0.16630 1 12.7321 0.0005222 ***
## biome_name 0.14170 4 2.7122 0.0333182 *
## Location:biome_name 0.05427 4 1.0387 0.3903703
## Residuals 1.52822 117
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot WEOC by depth
bgc_data %>%
ggplot(aes(x = Location, y = WEOC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3, outlier.colour = NA)+
geom_jitter(size = 3, width = 0.3)+
labs(x = "")
plot WEOC by biome
bgc_data %>%
ggplot(aes(x = biome_name, y = WEOC, color = Location))+
geom_boxplot(position = position_dodge(width = 0.5), width = 0.3)+
geom_point(size = 3, position = position_dodge(width = 0.5))+
scale_x_discrete(labels = function(x) str_wrap(x, width = 10))+
labs(x = "")
The {ggplot2} package has functionality to easily plot
maps (without downloading any other files or packages!).
We will plot our data over a map of the USA.
To do so, first load the states_map file, which contains
the data for the state outlines.
Then, plot the state boundaries using the geom_polygon
function.
states_map <- ggplot2::map_data("state")
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)
Now, to plot our data over this map.
We add the geom_point layer over the base polygon layer of
the states.
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data,
aes(x = Long, y = Lat,
fill = TC, group = 1),
size = 6, shape = 21, color = "black")
The default {ggplot2} color palette may be difficult to
interpret.
As an alternative, we use the {soilpalettes} package.
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data %>% filter(Location == "TOP"),
aes(x = Long, y = Lat, color = TC, group = 1),
size = 6)+
scale_color_gradientn(colors = soil_palette("redox2", 5))+
labs(color = "TC, %")+
theme_void(base_size = 16)+
theme(legend.position = "bottom")
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data %>% filter(Location == "TOP"),
aes(x = Long, y = Lat, color = WEOC, group = 1),
size = 6)+
scale_color_gradientn(colors = soil_palette("redox2", 5))+
labs(color = "WEOC, mg/g")+
theme_void(base_size = 16)+
theme(legend.position = "bottom")
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data %>% filter(Location == "TOP"),
aes(x = Long, y = Lat, color = CEC, group = 1),
size = 6)+
scale_color_gradientn(colors = soil_palette("redox2", 5))+
labs(color = "CEC, meq/100g")+
theme_void(base_size = 16)+
theme(legend.position = "bottom")
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data %>% filter(Location == "TOP"),
aes(x = Long, y = Lat, color = Clay, group = 1),
size = 6)+
scale_color_gradientn(colors = soil_palette("redox2", 5))+
labs(color = "Clay, %")+
theme_void(base_size = 16)+
theme(legend.position = "bottom")
Another map example, showing the different biomes
ggplot(states_map, aes(long, lat, group = group))+
geom_polygon(color = "grey", fill = NA)+
geom_point(data = bgc_data,
aes(x = Long, y = Lat,
color = biome_name, group = 1),
size = 6)+
scale_color_manual(values = soil_palette("redox2", 5))+
theme_void(base_size = 16)+
labs(fill = "")
We plot % sand-silt-clay on a texture triangle (ternary plot) to
determine the soil texture. We use the ggtern package for
this.
You first plot the base ternary triangle. You will need to load the
USDA dataset from ggtern.
data(USDA)
texture_base =
ggtern() +
geom_polygon(data = USDA,
aes(x = Sand, y = Clay, z = Silt, group = Label),
fill = NA, size = 0.3, alpha = 0.5, color = "grey30")+
theme_bw()+
theme_showarrows()+
theme_hidetitles()+
theme_clockwise()
texture_base
then overlay your data points
texture_base +
geom_point(data = bgc_data,
aes(x = Sand, y = Clay, z = Silt),
size = 4, color = "black")
Now, color the data points by CEC
texture_cec =
texture_base +
geom_point(data = bgc_data %>% filter(!is.na(CEC)),
aes(x = Sand, y = Clay, z = Silt, color = CEC),
size = 4, )+
scale_color_gradientn(colors = soil_palette("redox2", 5))+
labs(fill = "CEC (meq/100g)")
Now, we need to add the texture labels.
Create a new dataframe from the USDA dataset (included with
ggtern). Then add that to the previous plot.
USDA_text <-
USDA %>%
group_by(Label) %>%
summarise_if(is.numeric, mean, na.rm = TRUE) %>%
ungroup()
#textures_names<-
texture_cec +
geom_text(data = USDA_text,
aes(x = Sand, y = Clay, z = Silt, label = Label),
size = 4, color = "black")